© USO—iStock/Getty Images
© USO—iStock/Getty Images

1. Introduction

1.1 Background and Motivation

At first, we wanted to analyse the impacts of the recent global pandemic on individuals within the context of a variety of mental health conditions, namely anxiety and depression, in a pre- and post-COVID framework. However, we decided to drop the idea due to a lack of data available regarding the topic, which wouldn’t have allowed us to conduct a thorough analysis. As we were searching for new topic ideas, we found a dataset that concerns shark attacks around the world. A group member has then told us about an incident that occurred during his last holidays in Egypt, involving a shark attack at a paradisiac beach. We discussed thereafter the presence of these predators in many different regions in the world including Florida, Hawaii, and Australia just to name a few. For this reason, we decided to dive deeper into the topic of shark attacks and explore potential influencing factors. After a few days of research and information gathering, we found that a wide range of factors are capable of influencing such incidents, both directly and indirectly. We all enjoy going on holidays and enjoy beaches while drinking a cocktail, and when we feel that temperature increases, we go in the water to refresh, but depending on the location we are, this could be unsafe.

Another motivation behind this project is our interest in the preservation of marine fauna. Very often human beings tend to primarily blame the animals for their fierce behavior (just look at the example of the attack that occurred in Hurghada during summer of 2023: after the death of the victim, the shark was killed!) but we often forget that the water it is their natural habitat, and that we are the guests. This point motivates us even more to study the phenomenon of attacks and to disclose how human activity causes shark attacks.

1.2 Project Objective

The project’s focus is to provide an understanding of the complex dynamics of shark attacks globally, by conducting an in-depth analysis on the different factors that influence the phenomenon. In addressing the topic at hand, we explore both direct and indirect influencing factors.

At the end of our work we want to learn how it is possible to reduce accidents caused by sharks. On the one hand, this is crucial to guarantee the safety of people who take pleasure in coastal waters and engage in water-related activities. On the other hand, the conservation of aquatic animals might benefit from this. Making efforts to preserve these species is essential for sustaining the health of marine ecosystems in a time when the extinction rate of many animal species is rising. Among the factors that might have an impact on the trend of shark attacks we have climatic variations. Ocean temperatures and currents are capable of affecting migration patterns and habitat preferences, which further complicates the dynamics of shark attacks. By providing a step-by-step analysis on climate-related aspects, we aim to provide empirical evidence on the impact of environmental changes on shark movements and interactions with humans.

This project’s goal is to contribute to the understanding of shark-related incidents and thereby to provide possible strategies such as policymaking, which would help mitigate such occurrences, ultimately safeguarding human life and protecting sharks. In essence, we aim to advance research on sustainable coexistence with the marine environment through actionable insights and intend to contribute to the broader efforts of marine conservation.

1.4 Research Questions

  1. What are the primary factors influencing the occurrence of shark attacks?
  2. Is it true that shark attacks have been increasing with climate change?
  3. What are the main factors leading to the fatality of a shark attack?

2. Data Sets

In the second section of our work, we will present the datasets that are necessary to answer our research questions. These datasets serve as the foundation for our investigation and provide the data required to draw insightful and important findings. The purpose of this part is to provide a clear overview of the data landscape and make clear to the reader what are the information to be found in each web-based data, and how do they interrelate.

2.1 Shark Attacks Data Set

Our first dataset contains the information in over 6000 shark attacks that have occurred worldwide over the course of the past centuries. The dataset was obtained from the Shark Research Institute’s (SRI) website, which carries out and funds rigorous, peer-reviewed shark field research and instructs people using scientific data.

Even though the dataset is owned by the SRI, we have downloaded it through the following plateform: https://www.kaggle.com/datasets/teajay/global-shark-attacks.


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Case Number", "Date", "Year", "Type", "Country", "Area", "Location", "Activity", "Name", "Sex", "Age", "Injury", "Fatal", "Time", "Species", "Investigator.or.Source", "pdf", "href.formula", "Case Number 2")
Description <- c("Case Number of accident", "Date of the accident", "Year of accident", "Whether the victim provoked the shark or not", "Country of accident", "Region", "Location", "Activity the victim was doing before accident","Name of the victim", "Sex of the victim", "Age of the victim","Injury provoked by the accident", "Whether the accident led to death or not (Binary variable)", "Time of accident", "Shark Species", "Investigator of the accident", "Pdf information of the accident", "Pdf link of the accident", "Case number (same aas the first one)")


Table1 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table1)
Variable Description
Case Number Case Number of accident
Date Date of the accident
Year Year of accident
Type Whether the victim provoked the shark or not
Country Country of accident
Area Region
Location Location
Activity Activity the victim was doing before accident
Name Name of the victim
Sex Sex of the victim
Age Age of the victim
Injury Injury provoked by the accident
Fatal Whether the accident led to death or not (Binary variable)
Time Time of accident
Species Shark Species
Investigator.or.Source Investigator of the accident
pdf Pdf information of the accident
href.formula Pdf link of the accident
Case Number 2 Case number (same aas the first one)

2.2 Temperatures Data Set

Our second dataset contains the average temperature change for each country (annually, monthly and seasonally) from 1961 to 2022. It contains 9,657 observations.

The dataset was taken from the website of the Food and Agriculture Organization of United Nations: https://www.fao.org/faostat/en/#data/ET


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Area Code", "Area", "Months ", "Month code ", "Element Code ", "Element", "Unit", "Years")
Description <- c("Code for each country", "Country for which the data has been collected", "Month during which the data on temperature has been collected", "Code of each month ", "Binary variable (7271: Temperature change; 6078: Standard Deviation)", "What is the data measuring (either temperature change or standard deviation of data collection)", "Unit of measure of temperature ", "Year of observations" )


Table2 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table2)
Variable Description
Area Code Code for each country
Area Country for which the data has been collected
Months Month during which the data on temperature has been collected
Month code Code of each month
Element Code Binary variable (7271: Temperature change; 6078: Standard Deviation)
Element What is the data measuring (either temperature change or standard deviation of data collection)
Unit Unit of measure of temperature
Years Year of observations

2.3 Sea Level Data Set

Our third dataset contains information on the global increase of sea level from 1993 to 2021. The following acronims are important to be noted: GMLS:Global mean sea level GIA: Global Isostatic Adjustment, which is the ongoing movement of land once burdened by ice-age glaciers (National Ocean Service, n.d.)

The dataset was taken from the website of the NASA: https://climate.nasa.gov/vital-signs/sea-level/


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Year", "TotalWeightedObservations ", "GMSL_noGIA ", "StdDevGMSL_noGIA ", "SmoothedGSML_noGIA ", "GMSL_GIA", "StdDevGMSL_GIA", "SmoothedGSML_GIA", "SmoothedGSML_GIA_sigremoved")
Description <- c("Year of the observation", "Total Weighted Observations of Sea Level", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed ")


Table3 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table3)
Variable Description
Year Year of the observation
TotalWeightedObservations Total Weighted Observations of Sea Level
GMSL_noGIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)
StdDevGMSL_noGIA Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)
SmoothedGSML_noGIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)
GMSL_GIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)
StdDevGMSL_GIA Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)
SmoothedGSML_GIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)
SmoothedGSML_GIA_sigremoved Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed

2.4 Greenhouse Gas Emissions Data Set

Our fourth dataset relates to GHG emissions, one of the main concerns linked to global warming. The dataset contains 31.315 observations on annual GHG emissions per country and per continent, but we will focus on the first case only.

The dataset was taken from the website Our World in Data: https://ourworldindata.org/annual-co2-emissions


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Entity", "Code", "Year", "Annual CO₂ emissions")
Description <- c("Country examinated", "ISO Code", "Year of the observation", "GHG emissions per year")


Table4 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table4)
Variable Description
Entity Country examinated
Code ISO Code
Year Year of the observation
Annual CO₂ emissions GHG emissions per year

3. Data Wrangling

The first step of our work is data cleaning. While collecting data for a study, it might happen to have some variables with missing information, worthless data or spelling mistakes In order to run a regression, it is fundamental that each column within a data set has no missing values and that information is written following the same format. This procedure takes the name of data cleaning, and it will cover the whole subsection.

Please be aware that at the beginning of each subsection we only provide general instructions of our cleaning. Through the code, you will get additional in-depth explanations of each step we took.

3.1 Attacks Cleaning

The first dataset we will examine is the one pertaining to shark attacks. It was quite full of misspellings and missing values, which required a lot of time and a very dense code to be able to clean it.

For our work, we have decided to only study the phenomenon of shark attacks from 1970 forward. In fact, it is possible to observe in the raw dataset that, before to this date, observations had more erroneous information. This leads one to believe that, more likely than not, the information we needed is more accurate if we do not travel too far back in time.

While cleaning the column of Ages, we realized that NA were too much and we could not delete them all, because that would have meant losing 29,5% of our data. Therefore, we have decided to work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency) of non-missing data. If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile.

Here is an example with USA and Bahamas. For USA, the distribution of the non-missing victims’ ages is skewed on the left. Therefore we have decided to substitute all the missing values with the first quantile. For Bahamas, it is evident that data follow a normal distribution, therefore we replace NAs with the mean.

#EGYPT#

ages_in_egy <- attacks %>%
  filter(Country == "EGYPT") %>%
  select(Age)

ages_vector <- ages_in_egy$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_egy = mean(ages_without_na)

attacks$Age[attacks$Country == "EGYPT" & is.na(attacks$Age)] <- mean_egy


#ITALY#

# Extract all ages from individuals who come from Italy
ages_in_italy <- attacks %>%
  filter(Country == "ITALY") %>%
  select(Age)

# Convert the extracted ages to a vector
ages_vector <- ages_in_italy$Age


# Display or use the 'ages_vector' containing ages from individuals in Italy

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_age = mean(ages_without_na)

# Replace all the NA on Italy with the mean of people attacked in Italy.

attacks$Age[attacks$Country == "ITALY" & is.na(attacks$Age)] <- 37.375

#AUSTRALIA#

ages_in_aus <- attacks %>%
  filter(Country == "AUSTRALIA") %>%
  select(Age)

ages_vector <- ages_in_aus$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_aus = mean(ages_without_na)

attacks$Age[attacks$Country == "AUSTRALIA" & is.na(attacks$Age)] <- mean_aus



#SOUTH AFRICA#

ages_in_SA <- attacks %>%
  filter(Country == "SOUTH AFRICA") %>%
  select(Age)

ages_vector <- ages_in_SA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_SA = mean(ages_without_na)

attacks$Age[attacks$Country == "SOUTH AFRICA" & is.na(attacks$Age)] <- mean_SA


#BRAZIL#

ages_in_BRA <- attacks %>%
  filter(Country == "BRAZIL") %>%
  select(Age)

ages_vector <- ages_in_BRA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_BRA = mean(ages_without_na)

attacks$Age[attacks$Country == "BRAZIL" & is.na(attacks$Age)] <- mean_BRA


#FIJI#

ages <- attacks %>%
  filter(Country == "FIJI") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FIJI" & is.na(attacks$Age)] <- mean

#FRENCH POLYNESIA#

ages <- attacks %>%
  filter(Country == "FRENCH POLYNESIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FRENCH POLYNESIA" & is.na(attacks$Age)] <- mean

#HONG KONG#

ages <- attacks %>%
  filter(Country == "HONG KONG") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "HONG KONG" & is.na(attacks$Age)] <- mean

#JAPAN#

ages <- attacks %>%
  filter(Country == "JAPAN") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "JAPAN" & is.na(attacks$Age)] <- mean

#MEXICO#

ages <- attacks %>%
  filter(Country == "MEXICO") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MEXICO" & is.na(attacks$Age)] <- mean


#MOZAMBIQUE#

ages <- attacks %>%
  filter(Country == "MOZAMBIQUE") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MOZAMBIQUE" & is.na(attacks$Age)] <- mean


#NEW ZELAND#

ages <- attacks %>%
  filter(Country == "NEW ZEALAND") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW ZEALAND" & is.na(attacks$Age)] <- mean

#PHILIPPINES#

ages <- attacks %>%
  filter(Country == "PHILIPPINES") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "PHILIPPINES" & is.na(attacks$Age)] <- mean

#REUNION#

ages <- attacks %>%
  filter(Country == "REUNION") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "REUNION" & is.na(attacks$Age)] <- mean

#NEW CALEDONIA#

ages <- attacks %>%
  filter(Country == "NEW CALEDONIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW CALEDONIA" & is.na(attacks$Age)] <- mean


#We fixed those countries that had lots of NA. Now, for the remaining NA we decided not to do anything for a specific reason. Not only we have not cound the real information on the internet, but we also believe that doing the mean for those remaining situations was useless, because these are the cases when NA are either the same amount of total shark attacks (see Antigua with 1 shark attack and 1 NA), or NA are more than half of the total attacks (see Malaysia with 4 total attacks but 3 of them are NA).


count_na_by_country <- attacks %>%
  group_by(Country) %>%
  summarise(NA_count = sum(is.na(Age)))


attacks <- subset(attacks, !is.na(Age))

#___________________________________________________________________________________________________________________


#NOW WORK ON TIME

#sometimes hours are written in a different format compared to most of the others, which follow the format
#13h30. Since they are not a lot we just replaced them manually

attacks<- mutate(attacks, Time= ifelse(Time=="Possibly same incident as 2000.08.21", "", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h00  -15h00", "14h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h30 / 15h30", "15h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h30 / 10h00", "9h45", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h45-11h15", "11h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="Sometime between 06h00 & 08hoo", "7h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="18h15-18h30", "18h20", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h00 - 09h30", "9h15", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h00 -- 11h00", "10h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="11h115", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 05h00 and 08h00", "6h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "17h00 or 17h40", "17h20", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "13h345", "13h45", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "<a0> ", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "09h00 -10h00", "9h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "2 hours after Opperman", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "11h00 / 11h30", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 06h00 & 07h20", "6h40", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "30 minutes after 1992.07.08.a", "", Time))



#now, our objective is to classify hours in parts of the day. indeed, it is useless to keep hours as they are
#for a regression because we want times like 7h30 and 7h00 OR 15h00 and 16h00 to be read as the same thing.
# some of the rows already had the part of the day in it, but in order to work easily with all the column, we
#replaced "morning" with 8h00 etc. in this way, we're able to remove all the strings that are useless. finally, 
#void rows have been replaced by an NA and hours, which were there as CHAR have been replaced by numeric

attacks$Time <- str_replace_all(attacks$Time, "\\bmorning\\b", "8h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bafternoon\\b", "15h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bevening\\b", "20h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bnight\\b", "23h00")

attacks$Time <- gsub("[^[:digit:]]", "", attacks$Time)
attacks$Time = na_if(attacks$Time, "")
attacks$Time <- as.numeric(attacks$Time)

# since we removed all the letters, hours now are not written as 8h00 or 15h30 but as 800 and 15h00. this is
# great for us, so that we can easily create a function that classifies those numbers as parts of the day. the 
#function works this way: if a value is included between 0 and 500 (i.e. midnight and 5a.m.), the value is replaced
#by the word "night" etc.

timeoftheday <- function(time) {
  if (!is.na(time)) {
    if (time >= 0 && time < 500) {
      return("night")
    } else if (time >= 500 && time < 1200) {
      return("morning")
    } else if (time >= 1200 && time < 1730) {
      return("afternoon")
    } else if (time >= 1730 && time < 2400) {
      return("evening")
    } else {
      return("")  # Handle any other cases (if necessary)
    }
  } else {
    return(NA)  # Retain NA values
  }
}

attacks$Time <- sapply(attacks$Time, timeoftheday)
attacks$Time <- tolower(attacks$Time)
attacks$Time <- na_if(attacks$Time, "")


sum(is.na(attacks$Time)) #this is the only one that still presents 1415 NA. we dont want to delete
#them all coz we'd lose 44% of our data. 
table(attacks$Time)#table shows that most of them happen in the afternoon, while 2ns place is owned by
#morning. To confirm the higher frequency of attacks between 8am and 6pm is this artile (link JC found??)
#we explain this by the fact that, naturally, most of people swim during the day. therefore, what we do
#is replacing NA randomly by the proportion of afternoon, morning and evening.

sum(!is.na(attacks$Time))

#create function that substitutes the NA in time with one of the 4 parts of the day, based on their
#proportion presence in our dataset

non_na_time_proportions <- table(attacks$Time) / sum(!is.na(attacks$Time))


# Identify NA positions
position_of_na <- is.na(attacks$Time)

# Generate random values based on proportions
attacks$Time[position_of_na] <- sample(
  names(non_na_time_proportions),
  sum(position_of_na),
  replace = TRUE,
  prob = as.vector(non_na_time_proportions)
)


#_____________________________________________________________________________________________
#NOW WE WORK ON SEX

attacks$Sex <- gsub("lli", "", attacks$Sex)
attacks$Sex <- gsub("M ", "M", attacks$Sex)
attacks$Sex <- na_if(attacks$Sex, "")
attacks$Sex <- ifelse(is.na(attacks$Sex), "Unknown", attacks$Sex)

#__________________________________________________________________________________________________________________


#NOW WE WORK ON SHARK SPECIES


#Creation of a variable species where we delete all the numbers that concern the size of the shark
attacks$Species <- str_remove_all(attacks$Species, "[[:digit:]]")

#If there is an empty cell, we will put NA
species = na_if(attacks$Species, "")

#gsub to take off punctuation
species <- gsub("[[:punct:]]", "", species)

#Delete all the numbers followed by cm or m
species <- gsub("\\d+\\s*(cm|m)\\b", "", species)

#Delete word composed by 1 or 2 letters
species <- gsub("\\b\\w{1,2}\\b", "", species)

#Delete all the numbers
species <- gsub("\\d", "", species)

#Delete all unities as lb, kg or l
species <- gsub("\\b(kg|lb|b)\\b", "", species)

#If the word shark appears more than once in each cell, we only write shark once
species <- gsub("\\bshark\\b(.*\\bshark\\b)?", "shark", species, ignore.case = TRUE)


# Rewrite shark species names in one way, and do that for each species
species<- gsub("^.*White shark.*$", "White shark", species)
species<- gsub("^.*whitetip shark.*$", "White shark", species)
species<- gsub("^.*white shark.*$", "White shark", species)
species<- gsub("^.*Wobbegong shark.*$", "Wobbegong shark", species)
species<- gsub("^.*Wobbegong.*$", "Wobbegong shark", species)
species<- gsub("^.*Zambesi shark.*$", "Zambesi shark", species)
species<- gsub("^.*Zambezi shark.*$", "Zambesi shark", species)
species<- gsub("^.*whaler shark.*$", "Whale shark", species)
species<- gsub("^.*whale shark.*$", "Whale shark", species)
species<- gsub("^.*tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*Tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*thresher shark.*$", "Thresher shark", species)
species<- gsub("^.*spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark feet.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*spinner  blacktip shark.*$", "Spinner shark", species)
species<- gsub("^.*Tawny nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Mako shark.*$", "Mako shark", species)
species<- gsub("^.*mako shark.*$", "Mako shark", species)
species<- gsub("^.*Lemon shark.*$", "Lemon shark", species)
species<- gsub(".*lemon shark.*", "Lemon shark", species, ignore.case = TRUE)
species<-gsub("^\\s*shark\\s*$", "Unidentified shark", species)
species<- gsub(".*bull.*", "Bull shark", species, ignore.case = TRUE)
species<- gsub(".*blue.*", "Blue shark", species, ignore.case = TRUE)
species<- gsub(".*reef.*", "Reef shark", species, ignore.case = TRUE)
species<- gsub(".*sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("^.*Sand shark.*$", "Sand shark", species)
species<- gsub(".*Sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*sandshark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*Sandbar shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("juvenile\\s+\\w+", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark  blacktip shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("blacktip\\s+\\w+", "Blacktip shark",species, ignore.case = TRUE)
species <- gsub("\\black\\w*\\b", "Blacktip shark", species, ignore.case = TRUE)

#Replace all occurrences of the word "blacktip" in a column with "Blacktip Shark," even if the word is accompanied by other words before or after

species <- gsub("\\bblacktip\\b", "Blacktip Shark", species, ignore.case = TRUE)
species<- gsub(".*copper shark*", "Copper shark", species, ignore.case = TRUE)
species <- gsub("\\bcow\\b", "Cow", species)
species <- gsub("\\bsilky\\b", "Silky", species)
species <- gsub("\\bsilvertip\\b", "Silvertip", species)


#Condition: if "Hammerhead" followed by other words then replaced by Hammerhead shark
species <- ifelse(grepl("Hammerhead\\s+\\w+", species, ignore.case = TRUE), "Hammerhead shark", species)
species <- ifelse(grepl("Blacktip\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Raggedtooth\\s+\\w+", species, ignore.case = TRUE), "Raggedtooth shark", species)
species <- ifelse(grepl("Porbeagle\\s+\\w+", species, ignore.case = TRUE), "Porbeagle shark", species)


#if the word shark does not appear then NA
species <- ifelse(grepl("shark", species, ignore.case = TRUE), species, NA)
species <- ifelse(grepl("Juvenile\\s+\\w+", species, ignore.case = TRUE), "Juvenile shark", species)
species <- ifelse(grepl("grey\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("greycolored\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("gray\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("Broadnose\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("7gill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("sevengill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("black\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Sand\\s+\\w+", species, ignore.case = TRUE), "Sand shark", species)
species <- ifelse(grepl("Carpet\\s+\\w+", species, ignore.case = TRUE), "Carpet shark", species)
species <- ifelse(grepl("\\bbrown\\b", species, ignore.case = TRUE), "Brown shark", species)

species <- gsub("^frag\\w+", "Unidentified shark", species)
species <- gsub("^shark$", "Unidentified shark", species)

#Function to check if "small" is present in cell
contains_small <- function(text) {
  return(grepl("small", text))
}
# Replace cell with "unidentified shark" if "small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "Small" is present in cell
contains_small <- function(text) {
  return(grepl("Small", text))
}
#  Replace cell with "unidentified shark" if "Small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "sharks" is present in cell
contains_sharks <- function(text) {
  return(grepl("sharks", text))
}
# Replace cell with"unidentified shark" if "sharks" is present
species <- ifelse(sapply(species, contains_sharks), "Unidentified shark", species)

#Function to check if "cookie" is present in cell
contains_cookie <- function(text) {
  return(grepl("cookie", text))
}
# Replace cell with"Cookiecutter shark" if "cookie" is present
species <- ifelse(sapply(species, contains_cookie), "Cookiecutter shark", species)


#Function to check if "involvement" is present in cell
contains_involvement <- function(text) {
  return(grepl("involvement", text))
}
# Replace cell with"No shark" if "involvement" is present
species <- ifelse(sapply(species, contains_involvement), "No shark", species)

#Function to check if "invovlement" is present in cell
contains_invovlement <- function(text) {
  return(grepl("invovlement", text))
}
# Replace cell with"No shark" if "invovlement" is present
species <- ifelse(sapply(species, contains_invovlement), "No shark", species)

#Function to check if "Not" is present in cell
contains_Not <- function(text) {
  return(grepl("Not", text))
}
# Replace cell with"No shark" if "Not" is present
species <- ifelse(sapply(species, contains_Not), "No shark", species)

#Function to check if "not" is present in cell
contains_not <- function(text) {
  return(grepl("not", text))
}
# Replace cell with"No shark" if "not" is present
species <- ifelse(sapply(species, contains_not), "No shark", species)

#Function to check if "Questionable" is present in cell
contains_Questionable <- function(text) {
  return(grepl("Questionable", text))
}
# Replace cell with"No shark" if "Questionable" is present
species <- ifelse(sapply(species, contains_Questionable), "No shark", species)

#Function to check if "questionable" is present in cell
contains_questionable <- function(text) {
  return(grepl("questionable", text))
}
# Replace cell with"No shark" if "questionable" is present
species <- ifelse(sapply(species, contains_questionable), "No shark", species)


#Function to check if "Salmon" is present in cell
contains_Salmon <- function(text) {
  return(grepl("Salmon", text))
}
# Replace cell with"Salmon shark" if "Salmon" is present
species <- ifelse(sapply(species, contains_Salmon), "Salmon shark", species)


#Function to check if "whaler" is present in cell
contains_whaler <- function(text) {
  return(grepl("whaler", text))
}
# Replace cell with"Whale shark" if "involvement" is present
species <- ifelse(sapply(species, contains_whaler), "Whale shark", species)


# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(seen|observed|Tooth|tooth|large|killed|captive|female|metre|foot|followed|caused|Said|young|probably|gaffed)", species, ignore.case = TRUE), "Unidentified shark", species)

# Replace the cell with "No shark" if any of the specified words are found
species <- ifelse(grepl("(hoax|No Shark)", species, ignore.case = TRUE), "No shark", species)

# Replace the cell with "Copper shark" if any of the specified words are found
species <- ifelse(grepl("(Copper)", species, ignore.case = TRUE), "Copper shark", species)

# Replace the cell with "Dogfish shark" if any of the specified words are found
species <- ifelse(grepl("(Dog|dogfish)", species, ignore.case = TRUE), "Dogfish shark", species)

# Replace the cell with "Dusky shark" if any of the specified words are found
species <- ifelse(grepl("(Dusky)", species, ignore.case = TRUE), "Dusky shark", species)

# Replace the cell with "Sevengill shark" if any of the specified words are found
species <- ifelse(grepl("(gill)", species, ignore.case = TRUE), "Sevengill shark", species)

# Replace the cell with "Angel shark" if any of the specified words are found
species <- ifelse(grepl("(Angel)", species, ignore.case = TRUE), "Angel shark", species)


attacks$Species <- species
attacks$Species <- ifelse(is.na(attacks$Species), "Unidentifies shark", attacks$Species)

#_____________________________________________________________________________________________

#ACTIVITY

#when trying to put everything in lower cap, this was the result: Errore in tolower(attacks$Activity) :  multibyte string 921 not valid --> therefore, we converted everything in ASCII (=American Standard Code for Information Interchange)

attacks$Activity <- iconv(attacks$Activity, to = "ASCII", sub = " ")
attacks$Activity <- gsub("[^ -~]", "", attacks$Activity)

attacks$Activity <- tolower(attacks$Activity)

#when running a table of all activities, we can see that we can group them in some categories: diving, race, windsurfing, walking, wading, wade fishing, touching, swimming, surfing, surf, standing, spearfishing, snorkeling,  scuba diving, playing, paddle, murder, kayaking, kayak, floating, fishing, so that we have standard categories.

attacks$Activity <- gsub("[^A-Za-z ]", "", attacks$Activity) 


kept_activities <- c("diving", "race", "windsurfing", "walking", "wading", "wade fishing", "touching", "swimming", "surfing", "surf", "standing", "spearfishing", "snorkeling", "scuba diving", "playing", "paddle", "murder", "kayaking", "kayak", "floating", "fishing")


# We create an expression pattern that matches any of the specific words
pattern <- paste(kept_activities, collapse = "|")

# Extract only the specific words (kept_activities) from the column and replace the rest with an empty string
attacks$Activity <- str_extract(attacks$Activity, paste0("\\b(?:", pattern, ")\\b"))

attacks <- attacks %>%
  mutate(Activity = str_replace_all(Activity, "\\bkayaking\\b", "kayak")) %>%
  mutate(Activity = str_replace_all(Activity, "\\bsurfing\\b", "surf"))

attacks$Activity[is.na(attacks$Activity)] <- "other" # for all activities that are not part of the list or for NA, we just put Other, since no information is available


#_____________________________________________________________________________________________

#FATAL 
 #When we run a table for the Fatality column, we see that almost observation follow the same category (Y = yes, N = No, Unknwn). An observation has a 2017, which is probably a typo (which we replace by empty case) while another one has an M. We want to believe that it was a typo as well, and that "M" was typed instead of "N".
attacks$Fatal..Y.N. <- gsub("2017", "", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <- gsub("M", "N", attacks$Fatal..Y.N.)

attacks$Fatal..Y.N. <-na_if(attacks$Fatal..Y.N., "")

#For the rest of NA, we have no information, so we took of these few observations.
attacks <- subset(attacks, !is.na(Fatal..Y.N.))
#Replace Yes by 1, 0 otherwise
attacks$Fatal..Y.N. <- ifelse(attacks$Fatal..Y.N. == "Y", 1, 0)

names(attacks)[names(attacks) == "Fatal..Y.N."] <- "Fatality"


#_____________________________________________________________________________________________
#final check up:

#We run a table for all columns, in order to check if columns have any missing value or if we still have other mistakes. 

table(attacks$Date) #this one is fine
table(attacks$Year) #this one is fine
table(attacks$Type) #here we have boating and boat which mean the same thing. Let us group them
attacks$Type <- gsub("Boating", "Boat", attacks$Type)
attacks$Type <- gsub("Boatomg", "Boat", attacks$Type)
table(attacks$Type) #this one is fine
table(attacks$Country)#this one is fine
table(attacks$Activity)#this one is fine
table(attacks$Age)#this one is fine
table(attacks$Fatality)#this one is fine
table(attacks$Species)#this one is fine


#With the following code, we will add a new column which will associate each country with its ISO code. This is a very important step for the cleaning if the enxt sections, because il will allow us to match the different datasets. 

# Get ISO country codes
iso_codes <- countrycode(attacks$Country, "country.name", "iso3c")
attacks$ISO_Code <- countrycode(attacks$Country, "country.name", "iso3c")

#Through the following function we can spot NAs. We can see that there are 23 countries with no ISO Code. We will fix them by hand. 
rows_with_na <- which(is.na(attacks$ISO_Code))

attacks$ISO_Code[attacks$Country %in% c("ENGLAND", "SCOTLAND", "BRITISH ISLES") & is.na(attacks$ISO_Code)] <- "GB"
attacks$ISO_Code[attacks$Country %in% c("AZORES") & is.na(attacks$ISO_Code)] <- "PRT"
attacks$ISO_Code[attacks$Country %in% c("ST. MAARTIN", "ST. MARTIN") & is.na(attacks$ISO_Code)] <- "MAF"
attacks$ISO_Code[attacks$Country %in% c("OKINAWA") & is.na(attacks$ISO_Code)] <- "JPN"
attacks$ISO_Code[attacks$Country %in% c("MICRONESIA") & is.na(attacks$ISO_Code)] <- "FSM"
attacks <- subset(attacks, !is.na(ISO_Code))

final_attacks_cleaned <- attacks

DT::datatable(attacks, options = list(pageLength = 10))

3.2 Temperatures Cleaning

The main mission for this dataset was to change the entire structure of the dataset. Originally, the dataset had the years as columns (one column for each year), while the rows were represented by the different countries; the temperature observations were presented by month, quarter and year. Furthermore, for each country and each month/quarter/year, the dataset shows both the change in temperatures and the standard deviation.

To proceed with the cleaning, we eliminated the rows corresponding to the standard deviation (we will not work with it), and the monthly and quarterly observations: we are only interested in yearly temperature variations.

We then rotated the dataset so as to have all the years under a single column; we subsequently matched each country and year with its temperature change observation. Finally we added the ISO code to merge with the other datasets.


temperature <- read_xlsx(here::here("data/Temperature.xlsx"))

#R did not read the "°C" on the column Unit, so we changed it manually

temperature <- temperature %>% mutate(Unit = "°C")

#We take off all the columns that we do not need for our study

temperature <- temperature %>% select(-'Area Code (M49)', -'Months Code', -'Element Code')

#We wliminate all columns having an F at the end (For each year there are 2 columns (ex: 1961 - 1961F; 1962 - 1962F). The column with the F at the end of the Year is a column where only the letter E appears for each row. We tried to look for its meaning on the FAO website but we did not succeed. Therefore, we removed it. What is more, it is important to notice that it was in any case unimportant for our work).

columns_to_remove <- grep("F$", names(temperature), value = TRUE)
temperature <- temperature[, !(names(temperature) %in% columns_to_remove)]

# We keep only meteorological year, we don't need to work with monthly and seasonal data.

target_name <- "Meteorological year"
temperature <- temperature[temperature$Months == target_name, ]

target_name2 <- "Temperature change"
temperature <- temperature[temperature$Element == target_name2, ]



#  We take off the rest of the columns that we do not need

temperature <- temperature %>% select(-'Area Code', -'Months', -'Element', -'Unit')

#The column of each year starts with an Y (Y1961, Y1961...). We take it off, so that we can easily match data with other datasets.

new_colnames <- gsub("Y", "", colnames(temperature))
colnames(temperature) <- new_colnames
temperature <- na.omit(temperature)


# We transpose the dataset so that columns become rows. With this function we have a single column, where each row corresponds to a year, but countries became the columns now. We will work on this later in the code.
temperature <- t(temperature)

# I want the first row to be the header

colnames(temperature) <- temperature[1, ]
clean_temperature <- temperature[-1, ]

#col names in CAPS
colnames(clean_temperature) <- toupper(colnames(clean_temperature))


library(tidyr)

# Convert the matrix/array "clean_temperatures" to a data frame
temperature <- as.data.frame(clean_temperature)

# Add 'Year' as a separate column
temperature$Year <- rownames(clean_temperature)

# Reshape the data from wide to long format
temperature <- tidyr::pivot_longer(temperature, 
                                     cols = -Year, 
                                     names_to = "Country", 
                                     values_to = "Temperature")

# Reorder columns as per your desired format
temperature <- temperature[c("Year", "Country", "Temperature")]

#As we said at the beginning, we only work with data from 1970
temperature <- temperature %>% filter(Year >= 1970)
temperature$Year <- as.numeric(temperature$Year)

temperature <- temperature %>% 
  mutate(Country = ifelse(Country == "UNITED STATES OF AMERICA", "USA", Country))

#As for the attacks dataset, we add a column for the ISO code.

temperature$ISO_Code <- countrycode(temperature$Country, "country.name", "iso3c")
temperature$ISO_Code[temperature$Country %in% c("MICRONESIA") & is.na(temperature$ISO_Code)] <- "FSM"

#Now that we put ISO code, we can remove the country column
temperature <- temperature %>% select(-'Country')

final_temperature_cleaned <- temperature

DT::datatable(temperature, options = list(pageLength = 10))

3.3 Sea Level Cleaning

We choose a data set about the sea level change observed during the last years. We start analyzing the sea level changes since 1993 because we did not find any data of years before. In this data set we had few columns but we only kept 2 columns that were the more relevant for us: the year and the GMSL_GIA. This last column is the global mean of sea level including the glacial isostatic adjustement (GIA). We choose the column with GIA and not without because this show the response of solid Earth to the past changes in surface loading by ice and water such as the glaciation and deglaciation.


sealevel <- read.csv(here::here("data/sealevel.csv"))

# keep the 2 columns we are interested to analyze because the other are irrelevant for our project
sealevel <- subset(sealevel, select = c(Year, GMSL_GIA))

#Show the year only the first time by creating a new column called Year2
sealevel$Year2 <- ifelse(duplicated(sealevel$Year), NA, sealevel$Year)

# delete column Year due to the creation of column Year 2
sealevel <- subset(sealevel, select = -Year)

# delete NA because it does not bring anything to our analysis
sealevel <- na.omit(sealevel)

#Change name of column Year
column <- gsub("2","",colnames(sealevel))
colnames(sealevel) <- column

final_sealevel_cleaned <- sealevel

DT::datatable(sealevel, options = list(pageLength = 10))

3.4 Greenhouse Gas Emissions Cleaning

Finally, the GHG emissions dataset was the easiest to clean. Indeed, we already have a column for the ISO code, so we simply had to filter the years so that the observations began in 1970.

co2 <- read.csv(here::here("data/CO2.csv"))

# Change names of columns in order to have the same columns that in the other datasets
names(co2)[names(co2) == "Entity"] <- "Country"
names(co2)[names(co2) == "Annual.CO..emissions"] <- "Annual CO2 Emissions"
names(co2)[names(co2) == "Code"] <- "ISO_Code"



# keep the 3 columns we are interested to analyze
co2 <- subset(co2, select = c("ISO_Code", "Year", "Annual CO2 Emissions"))


# only keep information starting in 1970 because we want to look at the evolution of 
# the last 50 years
co2<- co2[co2$Year >= 1970, ]

co2$ISO_Code <- na_if(co2$ISO_Code, "")

co2 <- subset(co2, !is.na(ISO_Code))

final_ghg_cleaned <- co2


DT::datatable(co2, options = list(pageLength = 10))

3.5 Merged Data Set

After cleaning the four datasets, we have finally merged them. This step is essential to be able to run regression across variables that came from multiple datasets.



#MERGE FIRST TWO DATA SETS

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data <- left_join(attacks, temperature, by = c("Year", "ISO_Code"))

#MERGE WITH THIRD

# Assuming 'Year' is the columns in both datasets for matching
merged_data2 <- left_join(merged_data, sealevel, by = c("Year"))

#MERGE WITH FOURTH

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
merged_data3$Temperature <- as.numeric(as.character(merged_data3$Temperature))
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))

#Change name of a column
colnames(merged_data3)[colnames(merged_data3) == 'Country.x'] <- 'Country'
merged_data3$Temperature <- as.numeric(merged_data3$Temperature)


#for merged_data3, which we will use for our regressions, we need numerical variables
#therefore, we will make changes in some categorical ones

#TYPE

#Let's transform the Type variable on numeric. 1 corresponds to Boat, 2 to Unprovoked, Invalid to 3
# Provoked to 4, Questionable to 5 and Sea Disaster to 6
categories <- c("Boat" = 1, "Unprovoked" = 2, "Invalid" = 3, "Provoked" = 4, "Questionable" = 5, "Sea Disaster" = 6)
merged_data3$Type <- factor(merged_data3$Type, levels = names(categories))
merged_data3$Type <- as.numeric(merged_data3$Type)

#TIME

# Let's focus on the transformation of time where morning correspond to 0, afternoon to 1,
# evening to 2 and night to 3
merged_data3$Time <- as.numeric(factor(merged_data3$Time, levels = c("morning", "afternoon", "evening", "night")))


#SEX

merged_data3$Sex <- ifelse(merged_data3$Sex == "M", 0, 
                      ifelse(merged_data3$Sex == "F", 1, 2))


#SPECIES


merged_data3 <- merged_data3 %>%
  mutate(Species = as.numeric(factor(Species)))

# Get the top 5 species
top_species <- merged_data3 %>%
  count(Species, sort = TRUE) %>%
  slice_head(n = 5)

# Create a logical condition to include only the top 5 species
condition <- merged_data3$Species %in% top_species$Species

# Subset the data based on the condition
species_filtered_data <- merged_data3[condition, ]



DT::datatable(merged_data3, options = list(pageLength = 10))

4. EDA

4.1 Evolution of shark attacks through the years

With the y-axis showing the frequency of shark attacks and the x-axis showing the passage of time, the graph below displays a striking pattern in shark attacks over time. The data exhibits a recognizable pattern that indicates a steady increase in the quantity of shark attacks within the designated period. This rising tendency begs critical concerns regarding what is causing such an increase and necessitates more research into ecological, environmental, and human-related elements that could be involved in this occurrence. The graph emphasizes the need of keeping an eye on and comprehending the dynamics of these episodes in addition to providing a visual depiction of the rise in shark attacks.

This trend is a key research topic: why are marine attacks increasing year after year? What could be the reasons? In fact, the reasons can be various, for example of an ecological, environmental nature or linked to human seaside activity. The graph highlights the need to keep an eye on and understand the dynamics of these episodes, which is exactly what we will do during our study.

Three fundamental areas of this graph should be an element of analysis: 1975-1978; 1988-1990; 2019-2020. In fact, we could ask ourselves, why is there a decrease in attacks in this period? For the last period, the answer is simple: our dataset presents data up to June 2018, for the second half of the year we have no data, which explains the drastic drop in attacks. For the other two periods, we will discover the reasons during a more in-depth analysis


# Create a plot to show the trend of shark attacks throughout the years, including victim's sex (interactive version)
attacks_evolution_sex <- ggplot(data = final_attacks_cleaned, aes(x = Year, fill = Sex)) +
  geom_bar(position = "stack", color = "white") +
  labs(title = "Shark Attacks Evolution Over Years by Sex",
       x = "Year",
       y = "Number of Attacks",
       fill = "Sex") +
  scale_fill_manual(values = c("pink", "blue", "orange"), name = "Sex") +
  theme_minimal()


interactive_attacks_evolution_sex <- ggplotly(attacks_evolution_sex)
interactive_attacks_evolution_sex

4.2 Pattern of shark attacks frequencies

Pattern of shark attacks frequencies through the day

The bar graph below, where the x-axis indicates four time categories —morning, afternoon, evening, and night— displays the distribution of shark attacks throughout the day. The graph’s most noticeable aspect is how frequently shark attacks occur in the afternoon, followed by the morning, evening and night. This concentration of incidents in the afternoon raises intriguing questions about potential human behavioral factors that could contribute to this temporal pattern.

First and foremost, we believe that the afternoon is when most people choose to go swimming, which contributes to the frequency of shark attacks during this time. Recreational water activities such as swimming, surfing, kayaking, and other water sports are probably more popular during these times. Furthermore, the afternoon is when the seas are the warmest, so it’s intriguing to find out if this aspect also has a significant influence on shark activity. Conversely, the decreased number of assaults that occur at night may be related to less people using the water during that time, but there may be other possible reasons as well (less high temperatures, poorer visibility).

Lastly, we suppose that the medium frequencies of the morning and evening might correspond to transitional periods when shark and human behavior are changing.


bar_colors <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")

# Create a sorted table
sorted_table <- table(final_attacks_cleaned$Time)
sorted_table <- sorted_table[order(-sorted_table)]

# Create a bar plot with the sorted data
barplot1 <- barplot(sorted_table,
                    col = bar_colors,
                    xlab = "Time of Day", ylab = "Frequency of Attacks",
                    main = "Frequency of Shark Attacks by Time of Day",
                    border = "white",
                    ylim = c(0, 1800),
                    space = 0.5,
                    cex.names = 0.8,
                    font.axis = 2,
                    beside = TRUE)


# Add text labels with frequencies on the bars
text_labels <- sorted_table
text(x = barplot1, y = sorted_table, labels = text_labels, pos = 3, cex = 0.8, col = "black")

Pattern of shark attacks frequencies per country

A first glance at our plotted data reveals that the United States, Australia, and South Africa are the three primary countries with a significant number of shark attacks.

shark_attacks_by_country <- final_attacks_cleaned %>%
  group_by(Country) %>%
  summarise(total_attacks = n())

shark_attacks_by_country <- shark_attacks_by_country %>%
  mutate(CountryCategory = ifelse(total_attacks >= 30, as.character(Country), "Other"))

# Order the countries by frequency in descending order
shark_attacks_by_country$CountryCategory <- reorder(shark_attacks_by_country$CountryCategory, -shark_attacks_by_country$total_attacks)

# Plot the data
plot2<- ggplot(data = shark_attacks_by_country, aes(x = total_attacks, y = CountryCategory)) +
  geom_col(fill = "skyblue") +
  labs(title = "Total Shark Attacks in Each Country", x = "Number of Attacks", y = "Country") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1)) +
  scale_y_discrete(labels = scales::label_wrap(10))

interactive_plot <- ggplotly(plot2)
interactive_plot

4.3 Sex of the victims

Better understanding of the attacks and the sex of the victims

We initially believed that men experienced attacks more frequently, but our surprise grew when we examined this chart. It highlights a substantial disparity in the distribution of attacks between men and women.

# Create a summary table with the count of attacks for each sex
attacks_summary_sex <- final_attacks_cleaned %>%
  group_by(Sex) %>%
  summarise(NumAttacks = n())

# Define the desired order of levels
sex_order <- c("M", "F", "Unknown")

# Convert the 'Sex' variable to a factor with the specified order
attacks_summary_sex$Sex <- factor(attacks_summary_sex$Sex, levels = sex_order)

# Bar plot to show the distribution of shark attacks by sex
interactive_bar_plot_sex_attacks <- ggplotly(
  ggplot(attacks_summary_sex, aes(x = Sex, y = NumAttacks, fill = Sex)) +
    geom_bar(stat = "identity", position = "dodge", width = 0.7) +
    labs(title = "Distribution of Shark Attacks by Sex",
         x = "Sex",
         y = "Number of Attacks") +
    scale_fill_manual(values = c("blue", "pink", "gray"))
)

# Display the interactive bar plot
interactive_bar_plot_sex_attacks

4.4 Continental distribution of the attacks

What are the continents that have the most attacks?

This plot illustrates the regions with the highest number of attacks since 1970, with North America and Australia & Oceania taking the lead.


#Create a new column in attacks that represents the continents
attacks_grouped <- final_attacks_cleaned %>%
  mutate(Region = case_when(
    ISO_Code %in% c(NA) ~ "Unknown",
    ISO_Code %in% c("AFG", "ARM", "AZE", "BHR", "BGD", "BTN", "BOL", "BGR", "BFA", "BDI", "KHM", "CMR", "CPV", "CHN", "CXR", "CYP", "HKG", "IND", "IDN", "IRN", "IRQ", "ISR", "JOR", "JPN", "KAZ", "KWT", "KGZ", "LAO", "LBN", "MAC", "MYS", "MDV", "MNG", "MMR", "NPL", "PRK", "OMN", "PAK", "PSE", "PHL", "QAT", "SAU", "SGP", "KOR", "LKA", "TUR", "SYR", "TWN", "TJK", "THA", "TKM", "ARE", "UZB", "YEM", "VNM") ~ "Asia",
    ISO_Code %in% c("ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BWA", "HRV", "CZE", "DNK", "EST", "FIN", "FRO", "FRA", "GEO", "DEU", "GRC", "GRL", "GLP", "HUN", "ISL", "IRL", "ITA", "LVA", "LTU", "LIE", "LUX", "MLT", "MDA", "MNE", "NLD", "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR", "GBR") ~ "Europe",
    ISO_Code %in% c("DZA", "AGO", "BEN", "CAF", "TCD", "COM", "COG", "CIV", "COD", "DJI", "TLS", "EGY", "ERI", "SWZ", "ETH", "GNQ", "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "LBY", "MDG", "MWI", "MLI", "MTQ", "MRT", "MUS", "MYT", "MAR", "MOZ", "NAM", "NER", "NGA", "REU", "RWA", "SHN", "SEN", "SYC", "SLE", "STP", "ZAF", "SOM", "SSD", "SDN", "SUR", "TZA", "TGO", "TUN", "UGA", "ZMB", "ZWE") ~ "Africa",
    ISO_Code %in% c("AIA", "ATG", "ABW", "BHS", "BMU", "BRB", "BLZ", "VGB", "BRN", "CAN", "CRI", "CUB", "CUW", "DMA", "DOM", "SLV", "GRD", "GTM", "HTI", "HND", "JAM", "MSR", "MEX", "USA", "NIC", "PAN", "KNA", "LCA", "SPM", "VCT", "SXM", "TCA", "TTO") ~ "North America",
    ISO_Code %in% c("ARG", "COL", "CHL", "ECU", "GUF", "GUY", "PRY", "PER", "URY", "VEN", "BRA", "BES") ~ "South America",
    ISO_Code %in% c("AUS", "COK", "FJI", "PYF", "KIR", "MHL", "FSM", "NRU", "NCL", "NZL", "NIU", "PLW", "PNG", "WSM", "SLB", "TON", "TUV", "VUT", "WLF") ~ "Australia & Oceania",
    ISO_Code %in% c("ATA") ~ "Antarctica"
  ))


# Filter data to include only valid regions (excluding "Unknown")
filtered_attacks_region <- attacks_grouped %>% 
  filter(!is.na(Region) & Region != "Unknown")

# Bar plot to show the distribution of shark attacks by region
interactive_bar_plot_region_attacks <- ggplotly(
  ggplot(filtered_attacks_region, aes(x = Region, fill = Region)) +
    geom_bar() +
    labs(title = "Distribution of Shark Attacks by Region",
         x = "Region",
         y = "Number of Attacks") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
)


# Display the interactive bar plot
interactive_bar_plot_region_attacks

4.5 Fatality of shark attacks

Are shark attacks more likely to lead to death?

IT WAS A PIECHART…

Fatalities by type of shark

As we can see in this graph, most of attacks were not fatal, meaning that sharks injured a lot of people without taking their lives away. The types of sharks that did the most of accidents were the following ones: Bull shark, Tiger shark and White shark, with the deadlies being the last one. Why do they commit attacks? Probably because their feeding behavior. Maybe they misidentify their predators attacking people instead of animals like turtles, seals, or sea lions. This is consistent with Bull shark and Tiger shark being repetitevly named as the deadlies sharks in the seas and oeceans (CBS News, 2010). Another relevant information derived from this graph is the big number of unidentified sharks. Out of all the attacks recorded by our dataset, we have almost 600 Unidentified species. It is such a pity not to be able to have this information, which would have led us to more insightful results.

# Get the top 10 shark species
top_10_sharks <- head(names(sort(table(final_attacks_cleaned$Species), decreasing = TRUE)), 10)

# Subset the data to include only the top 10 shark species
final_attacks_top_10 <- final_attacks_cleaned[final_attacks_cleaned$Species %in% top_10_sharks, ]

# Create the summary table for the top 10 shark species
summary_table_top_10 <- table(final_attacks_top_10$Species, final_attacks_top_10$Fatality)

# Convert the summary table to a data frame
summary_df_top_10 <- as.data.frame(summary_table_top_10)

# Create the ggplot with the top 10 shark species
plot_top_10 <- ggplot(summary_df_top_10, aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Fatality by Top 10 Types of Shark",
    x = "Type of Shark",
    y = "Frequency",
    fill = "Legend"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(
    name = "Was the shark attack fatal?",
    values = c("blue", "pink", "yellow"),
    labels = c("Yes", "No", "Unknown")
  )

# Display the plot
print(plot_top_10)

## 4.6 Activities related to the attacks

What are the activities with the most attacks?

Identifying the activities most commonly targeted by shark attacks was a crucial aspect of our analysis. The results underscore that among various water-related activities, surfing stands out as the predominant target, with 903 attacks recorded since 1970, making it the preferred activity for sharks. This is not a surprise as the similarity in shape between a surfboard and a seal may lead sharks to mistake the two, providing a plausible explanation for the higher incidence of attacks during surfing.


# Check the top activities with the most attacks
top_activities <- final_attacks_cleaned %>%
  group_by(Activity, Sex) %>%
  summarize(Number_of_Attacks = n()) %>%
  arrange(desc(Number_of_Attacks)) %>%
  top_n(30)  # Adjust the number if you want more or fewer top activities

# Plot the results
top_activities_attacks <- ggplot(top_activities, aes(x = reorder(Activity, -Number_of_Attacks), y = Number_of_Attacks, fill = Sex)) +
  geom_bar(stat = "identity") +
  labs(title = "Top Activities with the Most Attacks",
       x = "Activities",
       y = "Number of Attacks",
       fill = "Sex") +
  scale_fill_manual(values = c("pink", "blue", "orange"), name = "Sex") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


interactive_top_activities_attacks <- ggplotly(top_activities_attacks)
interactive_top_activities_attacks

4.7 Monthly distribution of the attacks

What is the time of the year that have the most shark attacks?

As illustrated in this graph, there is a tendency for shark attacks to occur more frequently during the summer months compared to the rest of the year. It seems pretty logical as there is more people in the water during summer time.



#Here it is great but I tried to class the months in order but it has not worked yet

# Create a plot to show the repartition of shark attacks per month by sex
attacks_per_months <- ggplot(data = final_attacks_cleaned, aes(x = factor(Date), fill = Sex)) +
  geom_bar(position = "stack", color = "white") +
  labs(title = "Shark Attacks Repartition by Month and Sex",
       x = "Month",
       y = "Number of Attacks",
       fill = "Sex") +
  scale_fill_manual(values = c("pink", "blue", "orange"), name = "Sex") +
  theme_minimal()


interactive_attacks_per_months <- ggplotly(attacks_per_months)
interactive_attacks_per_months

4.8 Evolution of temperatures through the years

The graph below illustrates the change in global temperature throughout the last few decades, showing an overall increase in temperature over the 50-year period. Despite some fluctuations, the upward trend indicates a clear warming pattern, with a notable acceleration following the 1980s and one of the two most significant rises occurring in the recent decade.

# Filter out NA values and non-numeric values in the Temperature column
temperature_data_filtered <- final_temperature_cleaned %>%
  filter(!is.na(Temperature), !is.na(as.numeric(Temperature)))

# Convert the Temperature column to numeric
temperature_data_filtered$Temperature <- as.numeric(temperature_data_filtered$Temperature)

# Calculate the mean temperature for each year
world_temperature <- temperature_data_filtered %>%
  group_by(Year) %>%
  summarise(World_Temperature = mean(Temperature, na.rm = TRUE))

# Create a line plot for the world temperature
plot5 <- ggplot(world_temperature, aes(x = Year, y = World_Temperature)) +
  geom_line() +
  labs(title = "World Temperature Change Over Years",
       x = "Year",
       y = "World Temperature Change") +
  theme_minimal()

interactive_plot <- ggplotly(plot5)
interactive_plot

4.9 Evolution of sea level through the yars

This line graph depicts the trend of rising sea levels from around the ’90s until the early 2020s, showing some fluctuations initially, and a notable upward trend thereafter. The early 2000s represent a steady rate of increase, with a minor decrease in the early 2010s. This pattern appears to be somewhat consistent with global temperature changes, and a steepening rise in the last couple of years indicate a complementary relation with the increasing temperature levels.

# Create a line plot
plot6 <- ggplot(data = final_sealevel_cleaned, aes(x = Year, y = GMSL_GIA)) +
  geom_line(color = "green", size = 1.5) +
  labs(title = "Evolution of Sea Level Over the Years", x = "Year", y = "Sea Level")

interactive_plot <- ggplotly(plot6)
interactive_plot

4.10 CO2 emissions insights

Trend of CO2 emissions per year

The bar graph reveals an evident increase of emissions in the depicted period, as indicated by a consistent upward trend with few fluctuations and no significant declines. The rise reflects the growth of industrialisation over the years, especially marked by the end of the Cold War and subsequent economic growth in many parts of the world. The early 2000s show a particularly sharp increase, which is likely to be as a result of the rapid increase in manufacturing and energy production in Asian regions. Post-2010 years show some fluctuations that could be attributed to emissions mitigation efforts, but the trend overall remains the same.

# Create a line plot with backticks to show the trend in co2 emissions throughout the years
plot7 <- ggplot(data = final_ghg_cleaned, aes(x = Year, y = `Annual CO2 Emissions`)) +
  geom_line(color = "blue") +
  labs(title = "Trend of CO2 Emissions Over the Years", x = "Year", y = "Annual CO2 Emissions")

interactive_plot <- ggplotly(plot7)
interactive_plot

Which continent generates more GHG emissions?

We can see that there is a significant disparity in emissions among different continents; Asia shows a sharp increase in emissions over the years, particularly accelerating after 2000, becoming the continent with the highest CO2 emissions. This is perfectly consistent with other data, showing that 5 out of the 6 most polluting countries in the world are in Asian, namely China, India, Japan, Russia and Iran. (Statista, 2021) Emissions in Europe depict a gradual rise until the early 90’s, with a noticeable fluctuation and decline thereafter. The North American region follows a similar trend to Europe with a steady increase until the 2000s, followed by a levelling off and a decline. On the other hand, CO2 emissions from South America show a steady increase over time with less fluctuations as compared to Europe and NA. Africa’s emissions have additionally been increasing at a steady rate over the last few decades, but they remain significantly lower compared to the rest of the continents. Oceania represents the lowest overall emissions among the listed regions, with an increase rate on par with Africa. Antarctica’s emissions are not visible on the graph, which is consistent with expectations as it is a largely uninhabited continent.


co2_grouped <- final_ghg_cleaned %>%
  mutate(Region = case_when(
    ISO_Code %in% c(NA) ~ "Unknown",
    ISO_Code %in% c("AFG", "ARM", "AZE", "BHR", "BGD", "BTN", "BOL", "BGR", "BFA", "BDI", "KHM", "CMR", "CPV", "CHN", "CXR", "CYP", "HKG", "IND", "IDN", "IRN", "IRQ", "ISR", "JOR", "JPN", "KAZ", "KWT", "KGZ", "LAO", "LBN", "MAC", "MYS", "MDV", "MNG", "MMR", "NPL", "PRK", "OMN", "PAK", "PSE", "PHL", "QAT", "SAU", "SGP", "KOR", "LKA", "TUR", "SYR", "TWN", "TJK", "THA", "TKM", "ARE", "UZB", "YEM", "VNM") ~ "Asia",
    ISO_Code %in% c("ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BWA", "HRV", "CZE", "DNK", "EST", "FIN", "FRO", "FRA", "GEO", "DEU", "GRC", "GRL", "GLP", "HUN", "ISL", "IRL", "ITA", "LVA", "LTU", "LIE", "LUX", "MLT", "MDA", "MNE", "NLD", "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR", "GBR") ~ "Europe",
    ISO_Code %in% c("DZA", "AGO", "BEN", "CAF", "TCD", "COM", "COG", "CIV", "COD", "DJI", "TLS", "EGY", "ERI", "SWZ", "ETH", "GNQ", "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "LBY", "MDG", "MWI", "MLI", "MTQ", "MRT", "MUS", "MYT", "MAR", "MOZ", "NAM", "NER", "NGA", "REU", "RWA", "SHN", "SEN", "SYC", "SLE", "STP", "ZAF", "SOM", "SSD", "SDN", "SUR", "TZA", "TGO", "TUN", "UGA", "ZMB", "ZWE") ~ "Africa",
    ISO_Code %in% c("AIA", "ATG", "ABW", "BHS", "BMU", "BRB", "BLZ", "VGB", "BRN", "CAN", "CRI", "CUB", "CUW", "DMA", "DOM", "SLV", "GRD", "GTM", "HTI", "HND", "JAM", "MSR", "MEX", "USA", "NIC", "PAN", "KNA", "LCA", "SPM", "VCT", "SXM", "TCA", "TTO") ~ "North America",
    ISO_Code %in% c("ARG", "COL", "CHL", "ECU", "GUF", "GUY", "PRY", "PER", "URY", "VEN", "BRA", "BES") ~ "South America",
    ISO_Code %in% c("AUS", "COK", "FJI", "PYF", "KIR", "MHL", "FSM", "NRU", "NCL", "NZL", "NIU", "PLW", "PNG", "WSM", "SLB", "TON", "TUV", "VUT", "WLF") ~ "Australia & Oceania",
    ISO_Code %in% c("ATA") ~ "Antarctica"
  ))


# Step 1: Calculate Mean Emissions by Year and Region
mean_emissions <- co2_grouped %>%
  group_by(Year, Region) %>%
  summarize(Mean_Emissions = mean(`Annual CO2 Emissions`, na.rm = TRUE))


# Now we get rid of that Unknown Region to have a better view of the other Regions

# Step 1: Calculate Mean Emissions by Year and Region and getting rid of the things we don't want
mean_emissions <- co2_grouped %>%
  filter(!is.na(Region) & !is.na(`ISO_Code`) & Region != "Unknown") %>%
  group_by(Year, Region) %>%
  summarize(Mean_Emissions = mean(`Annual CO2 Emissions`, na.rm = TRUE))
# Step 2: Let's plot the Data
plot2 <- ggplot(mean_emissions, aes(x = Year, y = Mean_Emissions, color = Region)) +
  geom_line() +
  labs(title = "CO2 Emissions by Continent",
       x = "Year",
       y = "Annual CO2 Emissions") +
  theme_minimal()

interactive_plot <- ggplotly(plot2)
interactive_plot

What are the countries with the most CO2 emissions?

Here is a plot to further illustrate the distribution of CO2 emissions in the world; the top 30 countries with the most shark attacks since 1970. One interesting thing is that most countries are European countries. However, without surprise, the two biggest pollutant name of the world are from North America and Asia.

## 4.11 Relationships plots between our variables

Now that we have a better understanding of our variables, we can analyse their relationships.

How is the relationship between the attacks and the age of the victims?

There exists a negative correlation between the age of the victim and the frequency of attacks. This can be attributed to factors such as the higher water engagement of teens and adults below 40 compared to older individuals, leading to a greater involvement in water-related activities.


# Create a summary table with the count of attacks for each age
attacks_summary <- final_attacks_cleaned %>%
  group_by(Age) %>%
  summarise(NumAttacks = n())

# Scatter plot to show the relationship between the number of shark attacks and age
interactive_attacks_age <- ggplotly(
  ggplot(attacks_summary, aes(x = Age, y = NumAttacks)) +
    geom_point(color = "blue", size = 2) +
    geom_smooth(method = "lm", se = FALSE, color = "red") +  # Add a linear trend line
    labs(title = "Relationship between Shark Attacks and Age",
         x = "Age",
         y = "Number of Attacks")
)

interactive_attacks_age

Does sea level and the number of shark attacks have a relationship?

We can discern a positive correlation between the increase in sea levels and the occurrence of shark attacks. This observation is particularly intriguing since we commonly understand that rising sea levels result from global warming. Consequently, it raises the possibility that human influence may be a more significant factor in the surge of shark attacks than previously acknowledged.

# Calculate the number of attacks per year
attacks_summary <- final_attacks_cleaned %>%
  group_by(Year) %>%
  summarize(Number_of_Attacks = n())

# Assuming there's a common column named "Year" in both datasets
merged_data5 <- merge(attacks_summary, sealevel, by = "Year", all.x = TRUE)

correlation_coefficient <- cor(merged_data5$Number_of_Attacks, merged_data5$GMSL_GIA, use = "complete.obs")

# Scatter plot
interactive_scatter_plot_sea_level_attacks <- ggplotly(
  ggplot(merged_data5, aes(x = GMSL_GIA, y = Number_of_Attacks)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "red") +  # Add a linear trend line
    labs(title = "Relationship Between Sea Level and Shark Attacks",
         x = "Global Mean Sea Level Global Isostatic Adjustment (GMSL_GIA)",
         y = "Number of Attacks")
)

# Display the interactive scatter plot
interactive_scatter_plot_sea_level_attacks

How is the relationship between temperature change and the number of shark attacks?

In addition to the previous plot, it is evident that there is a positive correlation between global temperatures and the frequency of shark attacks. This further strengthens the earlier argument, suggesting potentially a stronger human responsibility.


# Identify non-numeric values in the Temperature column
non_numeric_temp <- final_temperature_cleaned %>%
  filter(!is.numeric(Temperature)) %>%
  distinct(Temperature)

# Convert "Temperature" column to numeric
final_temperature_cleaned$Temperature <- as.numeric(final_temperature_cleaned$Temperature)

# Check for negative values
negative_values <- final_temperature_cleaned %>%
  filter(Temperature < 0)

# Calculate the mean temperature
mean_temp_world <- final_temperature_cleaned %>%
  group_by(Year) %>%
  summarize(Mean_Temperature = mean(Temperature, na.rm = TRUE))

# Filter out non-numeric values in the Temperature column
temperature2 <- final_temperature_cleaned %>%
  filter(is.numeric(Temperature))

# Calculate mean temperature for the world per year
mean_temp_world <- temperature2 %>%
  group_by(Year) %>%
  summarize(Mean_Temperature = mean(Temperature, na.rm = TRUE))

# Merge datasets
merged_data6 <- merge(attacks_summary, mean_temp_world, by = "Year", all.x = TRUE)

# Plotting
interactive_scatter_plot_worldtemperature_attacks <- ggplotly(
  ggplot(merged_data6, aes(x = Mean_Temperature, y = Number_of_Attacks)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "red") +  # Add a linear trend line
    labs(title = "Relationship between Shark Attacks and World Mean Temperature Change",
         x = "World Mean Temperature Change",
         y = "Number of Shark Attacks") +
    theme_minimal()
)


interactive_scatter_plot_worldtemperature_attacks

Interactive map

#> # A tibble: 69 x 2
#>    Country    Attackscountry
#>    <chr>               <int>
#>  1 ARUBA                   1
#>  2 AUSTRALIA             520
#>  3 BAHAMAS                77
#>  4 BRAZIL                 96
#>  5 CANADA                  1
#>  6 CHILE                   1
#>  7 CHINA                   4
#>  8 COLOMBIA                3
#>  9 COSTA RICA              5
#> 10 CROATIA                 3
#> # i 59 more rows

5. Analysis

5.1 RQ1: Which are the primary factors influencing the occurrence of shark attacks?

After a deep Exploratory Data Analysis we thought about the relevance of some factors that might influence directly the occurrence of shark attacks. Directly in the sense that these are variables that were found in the main data set and not in a secondary one. Let’s begin then:

Before doing any regression, we will compute the correlation matrix in order to analyze our numerical variables

#>            Year    Type     Sex      Age    Time  Species
#> Year     1.0000  0.0280 -0.0546  0.09953 -0.0239 -0.07148
#> Type     0.0280  1.0000 -0.1276  0.03761  0.0650 -0.10931
#> Sex     -0.0546 -0.1276  1.0000 -0.04967  0.0134  0.01406
#> Age      0.0995  0.0376 -0.0497  1.00000 -0.1298 -0.00215
#> Time    -0.0239  0.0650  0.0134 -0.12980  1.0000 -0.01346
#> Species -0.0715 -0.1093  0.0141 -0.00215 -0.0135  1.00000
  • Thanks to this correlation matrix, we can affirm that: o Year has a slight positive correlation (0.13) with Age. The older the individual, there is a slight increase in the likelihood of being tied up by a shark, probably due to the activity the individual was doing such as surfing or swimming over time. In another side, there is a negative correlation (-0.1) with Species. o Age has a slightly positive correlation (0.11) with Time almost. o The variable Type has a slight positive correlation with Age and Species while it has a slight negative correlation with Year, Species and Sex. o There is a negative correlation between Year and Species. o The following variables have no significant correlation with the other numeric variables, due to the value of the coefficient close to 0: Sex and Type.

The regression that we will use is the following one:

\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon \]

where:

  1. \(\hat{Y}\) = Frequency of Shark Attacks
  2. \(X_{\text{1}}\) = Year
  3. \(X_{\text{2}}\) = Sex
  4. \(X_{\text{3}}\) = Age
  5. \(X_{\text{4}}\) = Species 6.\(X_{\text{5}}\) = Type
  6. \(X_{\text{6}}\) = Time
  7. \(\epsilon\) = Residual Error

Let’s do our main regression based on all the countries

#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = merged_map)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -1206   -561    349    530   1221 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -6113.877   1764.321   -3.47  0.00054 ***
#> Year            3.721      0.879    4.23  2.4e-05 ***
#> Sex           -33.696     20.080   -1.68  0.09343 .  
#> Age            -9.595      0.857  -11.20  < 2e-16 ***
#> Species        -2.015      1.330   -1.52  0.12985    
#> Type          -45.472     15.136   -3.00  0.00269 ** 
#> Time           -1.139     16.719   -0.07  0.94569    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 603 on 2894 degrees of freedom
#> Multiple R-squared:  0.0509, Adjusted R-squared:  0.0489 
#> F-statistic: 25.9 on 6 and 2894 DF,  p-value: <2e-16
Regression on the main factors influencing shark attacks
Dependent variable:
Attackscountry
Year 3.720***
(0.879)
Sex -33.700*
(20.100)
Age -9.600***
(0.857)
Species -2.020
(1.330)
Type -45.500***
(15.100)
Time -1.140
(16.700)
Constant -6,114.000***
(1,764.000)
Observations 2,901
R2 0.051
Adjusted R2 0.049
Residual Std. Error 603.000 (df = 2894)
F Statistic 25.900*** (df = 6; 2894)
Note: p<0.1; p<0.05; p<0.01
  • We just did a linear regression where the response variable is Attackscountry predicted by the following variables: Year, Sex, Age, Time, Type and Species.
  • The intercept represents the predicted value of Attackscountry when almost all variables are zero except Age that could not be zero.
  • The variable Year is statistically significant (p-value < 0.05), indicating that there is a significant relationship between the Year and the number of shark attacks. The variable Type is statistically significant at the 0.1 level.
  • The variable Age is also statistically significant at the same level (0.1%).
  • In the other side, the variables that are not statistically significant at this level are: Sex and Species.
  • The R-squared in this model is of 5.2 %. Thus, 5.2% of the variance in the response variable (Attackscountry) is explained by this model. The adjusted R-squared is similar, 5 %. This model does not explain a large portion of the variability of this variable.
  • Concerning the F-statistic we can say that there is a low p-value suggesting that at least one predictor variable is significantly related to the response variable.

This first regression will be based only on the USA

#> [1] "Regression for USA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001682565    0.0000000001644319
#> Year          -0.0000000000001029    0.0000000000000820
#> Sex            0.0000000000012610    0.0000000000019501
#> Age            0.0000000000000542    0.0000000000000701
#> Species       -0.0000000000000567    0.0000000000001135
#> Type           0.0000000000001034    0.0000000000016152
#> Time          -0.0000000000003080    0.0000000000014886
#>                      t value            Pr(>|t|)    
#> (Intercept) 8994604698864.05 <0.0000000000000002 ***
#> Year                   -1.26                0.21    
#> Sex                     0.65                0.52    
#> Age                     0.77                0.44    
#> Species                -0.50                0.62    
#> Type                    0.06                0.95    
#> Time                   -0.21                0.84    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  245 on 6 and 1472 DF,  p-value: <0.0000000000000002
#>    Year     Sex     Age Species    Type    Time 
#>    1.03    1.02    1.03    1.03    1.05    1.01
Regression on the main factors influencing shark attacks focused on the USA
Dependent variable:
Attackscountry
Year -0.000
(0.000)
Sex 0.000
(0.000)
Age 0.000
(0.000)
Species -0.000
(0.000)
Type 0.000
(0.000)
Time -0.000
(0.000)
Constant 1,479.000***
(0.000)
Observations 1,479
R2 0.500
Adjusted R2 0.498
Residual Std. Error 0.000 (df = 1472)
F Statistic 245.000*** (df = 6; 1472)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the country that has the most shark attacks in the world according to our main dataset.
  • None of the predictor variables are statistically significant predicting the number of shark attacks in the USA.
  • This model fits the data concerning USA very well, because of the high R-Squared value (0.5) and the small residuals. Does this model fit well or is there some multicollinearity?
  • According to the VIF analysis, the values are low. As they are below 5, we cannot say that there is high multicollinearity between some variables because here all the values are close to 1. Each variable contributes relatively independently to explaining the variability of Attackscountry.
#> Variable added: Year  | AIC: -66827 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014319  0.0000000000003  0.0000000000012  0.0000000000019 
#>              Max 
#>  0.0000000000025 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001439275    0.0000000001614318
#> Year          -0.0000000000000911    0.0000000000000807
#>                      t value            Pr(>|t|)    
#> (Intercept) 9161765526064.88 <0.0000000000000002 ***
#> Year                   -1.13                0.26    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1477 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:   0.5 
#> F-statistic: 1.48e+03 on 1 and 1477 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Sex  | AIC: -66826 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014315  0.0000000000002  0.0000000000012  0.0000000000019 
#>              Max 
#>  0.0000000000028 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001498393    0.0000000001618005
#> Year          -0.0000000000000943    0.0000000000000809
#> Sex            0.0000000000011236    0.0000000000019330
#>                      t value            Pr(>|t|)    
#> (Intercept) 9140887108111.84 <0.0000000000000002 ***
#> Year                   -1.17                0.24    
#> Sex                     0.58                0.56    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1476 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  738 on 2 and 1476 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Age  | AIC: -66824 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014309  0.0000000000000  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000038 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001623448    0.0000000001625604
#> Year          -0.0000000000001011    0.0000000000000813
#> Sex            0.0000000000012543    0.0000000000019402
#> Age            0.0000000000000553    0.0000000000000695
#>                      t value            Pr(>|t|)    
#> (Intercept) 9098156172803.24 <0.0000000000000002 ***
#> Year                   -1.24                0.21    
#> Sex                     0.65                0.52    
#> Age                     0.80                0.43    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1475 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  492 on 3 and 1475 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66823 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014307 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001696208    0.0000000001632457
#> Year          -0.0000000000001039    0.0000000000000815
#> Sex            0.0000000000012481    0.0000000000019407
#> Age            0.0000000000000558    0.0000000000000696
#> Species       -0.0000000000000577    0.0000000000001121
#>                      t value            Pr(>|t|)    
#> (Intercept) 9059961123428.15 <0.0000000000000002 ***
#> Year                   -1.27                0.20    
#> Sex                     0.64                0.52    
#> Age                     0.80                0.42    
#> Species                -0.51                0.61    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1474 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  368 on 4 and 1474 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66821 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001693934    0.0000000001633086
#> Year          -0.0000000000001034    0.0000000000000816
#> Sex            0.0000000000012497    0.0000000000019414
#> Age            0.0000000000000545    0.0000000000000699
#> Species       -0.0000000000000578    0.0000000000001122
#> Time          -0.0000000000003013    0.0000000000014843
#>                      t value            Pr(>|t|)    
#> (Intercept) 9056472162651.47 <0.0000000000000002 ***
#> Year                   -1.27                0.21    
#> Sex                     0.64                0.52    
#> Age                     0.78                0.44    
#> Species                -0.52                0.61    
#> Time                   -0.20                0.84    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1473 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  295 on 5 and 1473 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001682565    0.0000000001644319
#> Year          -0.0000000000001029    0.0000000000000820
#> Sex            0.0000000000012610    0.0000000000019501
#> Age            0.0000000000000542    0.0000000000000701
#> Species       -0.0000000000000567    0.0000000000001135
#> Time          -0.0000000000003080    0.0000000000014886
#> Type           0.0000000000001034    0.0000000000016152
#>                      t value            Pr(>|t|)    
#> (Intercept) 8994604698864.05 <0.0000000000000002 ***
#> Year                   -1.26                0.21    
#> Sex                     0.65                0.52    
#> Age                     0.77                0.44    
#> Species                -0.50                0.62    
#> Time                   -0.21                0.84    
#> Type                    0.06                0.95    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  245 on 6 and 1472 DF,  p-value: <0.0000000000000002

In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.

In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.

Here we have the AIC values for each model: 1. Model with Year only: AIC = -66827.49 2. Model with Year and Sex: AIC = -66825.83 3. Model with Year, Sex, and Age: AIC = -66824.46 4. Model with Year, Sex, Age, and Species: AIC = -66822.72 5. Model with Year, Sex, Age, Species and Time: AIC = -66820.77 6. Model with Year, Sex, Age, Species, Time and Type: AIC = -66818.77

As said previously, we chose the model with the lowest AIC indicating a better fit of the model, but we decided to choose it because the model includes more variables than the others, thus best explaining the variability in the data.

This second regression will be based only on Australia

#> [1] "Regression for AUSTRALIA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012332 -0.00000000000012  0.00000000000024 
#>                3Q               Max 
#>  0.00000000000050  0.00000000000150 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 519.99999999999772626   0.00000000003777441
#> Year         -0.00000000000000106   0.00000000000001876
#> Sex           0.00000000000018077   0.00000000000041286
#> Age          -0.00000000000000712   0.00000000000002074
#> Species      -0.00000000000000743   0.00000000000003222
#> Type          0.00000000000012036   0.00000000000029065
#> Time         -0.00000000000062736   0.00000000000034340
#>                       t value            Pr(>|t|)    
#> (Intercept) 13765932943737.86 <0.0000000000000002 ***
#> Year                    -0.06               0.955    
#> Sex                      0.44               0.662    
#> Age                     -0.34               0.732    
#> Species                 -0.23               0.818    
#> Type                     0.41               0.679    
#> Time                    -1.83               0.068 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 513 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.494 
#> F-statistic: 85.5 on 6 and 513 DF,  p-value: <0.0000000000000002
Regression on the main factors influencing shark attacks focused on AUSTRALIA
Dependent variable:
Attackscountry
Year -0.000
(0.000)
Sex 0.000
(0.000)
Age -0.000
(0.000)
Species -0.000
(0.000)
Type 0.000
(0.000)
Time -0.000*
(0.000)
Constant 520.000***
(0.000)
Observations 520
R2 0.500
Adjusted R2 0.494
Residual Std. Error 0.000 (df = 513)
F Statistic 85.500*** (df = 6; 513)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the second country that has the most shark attacks in the world according to our main dataset.
  • None of these variables are statistically significant. The residual standard error is very small and the multiple R-squared is approximately 50%. It indicates that there is a good fit and the variability is explained by 50% by the model, respectively.
#> Variable added: Year  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012420  0.00000000000023  0.00000000000025 
#>                3Q               Max 
#>  0.00000000000026  0.00000000000027 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 519.99999999999772626   0.00000000003604512
#> Year         -0.00000000000000173   0.00000000000001800
#>                      t value            Pr(>|t|)    
#> (Intercept) 14426365221367.6 <0.0000000000000002 ***
#> Year                    -0.1                0.92    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000546 on 518 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  517 on 1 and 518 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Sex  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012415  0.00000000000027  0.00000000000029 
#>                3Q               Max 
#>  0.00000000000029  0.00000000000030 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999995338840   0.000000000036493311
#> Year         -0.000000000000000546   0.000000000000018213
#> Sex           0.000000000000176122   0.000000000000402815
#>                       t value            Pr(>|t|)    
#> (Intercept) 14249186641170.37 <0.0000000000000002 ***
#> Year                    -0.03                0.98    
#> Sex                      0.44                0.66    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 517 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  259 on 2 and 517 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Age  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012415  0.00000000000027  0.00000000000029 
#>                3Q               Max 
#>  0.00000000000029  0.00000000000030 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999995338840   0.000000000036790730
#> Year         -0.000000000000000566   0.000000000000018402
#> Sex           0.000000000000176164   0.000000000000403241
#> Age           0.000000000000000161   0.000000000000020290
#>                       t value            Pr(>|t|)    
#> (Intercept) 14133995074177.91 <0.0000000000000002 ***
#> Year                    -0.03                0.98    
#> Sex                      0.44                0.66    
#> Age                      0.01                0.99    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 516 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.497 
#> F-statistic:  172 on 3 and 516 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012414  0.00000000000021  0.00000000000030 
#>                3Q               Max 
#>  0.00000000000031  0.00000000000034 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999996248334   0.000000000037075547
#> Year         -0.000000000000000868   0.000000000000018490
#> Sex           0.000000000000180007   0.000000000000404139
#> Age           0.000000000000000389   0.000000000000020345
#> Species      -0.000000000000005997   0.000000000000032010
#>                       t value            Pr(>|t|)    
#> (Intercept) 14025416977620.41 <0.0000000000000002 ***
#> Year                    -0.05                0.96    
#> Sex                      0.45                0.66    
#> Age                      0.02                0.98    
#> Species                 -0.19                0.85    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000548 on 515 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.496 
#> F-statistic:  129 on 4 and 515 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012336 -0.00000000000014  0.00000000000025 
#>                3Q               Max 
#>  0.00000000000049  0.00000000000171 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 520.00000000000056843   0.00000000003707596
#> Year         -0.00000000000000238   0.00000000000001847
#> Sex           0.00000000000014561   0.00000000000040371
#> Age          -0.00000000000000639   0.00000000000002065
#> Species      -0.00000000000000896   0.00000000000003198
#> Time         -0.00000000000061699   0.00000000000034221
#>                       t value            Pr(>|t|)    
#> (Intercept) 14025261272954.00 <0.0000000000000002 ***
#> Year                    -0.13               0.897    
#> Sex                      0.36               0.718    
#> Age                     -0.31               0.757    
#> Species                 -0.28               0.780    
#> Time                    -1.80               0.072 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000546 on 514 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.495 
#> F-statistic:  103 on 5 and 514 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012332 -0.00000000000012  0.00000000000024 
#>                3Q               Max 
#>  0.00000000000050  0.00000000000150 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 519.99999999999772626   0.00000000003777441
#> Year         -0.00000000000000106   0.00000000000001876
#> Sex           0.00000000000018077   0.00000000000041286
#> Age          -0.00000000000000712   0.00000000000002074
#> Species      -0.00000000000000743   0.00000000000003222
#> Time         -0.00000000000062736   0.00000000000034340
#> Type          0.00000000000012036   0.00000000000029065
#>                       t value            Pr(>|t|)    
#> (Intercept) 13765932943737.87 <0.0000000000000002 ***
#> Year                    -0.06               0.955    
#> Sex                      0.44               0.662    
#> Age                     -0.34               0.732    
#> Species                 -0.23               0.818    
#> Time                    -1.83               0.068 .  
#> Type                     0.41               0.679    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 513 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.494 
#> F-statistic: 85.5 on 6 and 513 DF,  p-value: <0.0000000000000002
#>    Year     Sex     Age Species    Type    Time 
#>    1.08    1.07    1.07    1.03    1.09    1.05
  • In this case, for each model we have the same AIC, that has a value of -66818.77. Increasing the number of variables that we add in the model does not seem to improve the significance of the model. As the AIC value is the same, we prefer to have as many variables as possible.
  • The model with all the variables will explain almost 50% of the variability. But this value is high, but according to the VIF analysis, there is no high multicollinearity between variables. We can conclude saying that this model fits well.

Let’s focus on both countries

#> [1] "Regression for USA and AUSTRALIA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -934   -503    192    249    605 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3829.068   1513.279    2.53              0.01147 *  
#> Year          -1.135      0.754   -1.51              0.13221    
#> Sex          -23.663     17.498   -1.35              0.17644    
#> Age           -5.800      0.678   -8.56 < 0.0000000000000002 ***
#> Species       -4.155      1.104   -3.76              0.00017 ***
#> Type         -28.026     13.714   -2.04              0.04111 *  
#> Time          17.633     13.784    1.28              0.20095    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1992 degrees of freedom
#> Multiple R-squared:  0.0503, Adjusted R-squared:  0.0474 
#> F-statistic: 17.6 on 6 and 1992 DF,  p-value: <0.0000000000000002
Regression on the main factors influencing shark attacks focused on the USA and AUSTRALIA
Dependent variable:
Attackscountry
Year -1.140
(0.754)
Sex -23.700
(17.500)
Age -5.800***
(0.678)
Species -4.160***
(1.100)
Type -28.000**
(13.700)
Time 17.600
(13.800)
Constant 3,829.000**
(1,513.000)
Observations 1,999
R2 0.050
Adjusted R2 0.047
Residual Std. Error 411.000 (df = 1992)
F Statistic 17.600*** (df = 6; 1992)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the two countries that have the most shark attacks in the world according to our main dataset.
  • Year and Sex are the two variables in this model that are not statistically significant while the others are, at different thresholds. Age and Species are statistically significant at 0.1%, while Time is statistically significant at 10%. The variable Type is statistically significant at 5%.
  • The variability of Attackscountry has a value of 5.25% (R^2) and is explained by the model. We conclude that this model is better explained by another model that contains more useful variables.
#> Variable added: Year  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -759   -684    244    259    275 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  4354.25    1520.51    2.86   0.0042 **
#> Year           -1.56       0.76   -2.06   0.0400 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 420 on 1997 degrees of freedom
#> Multiple R-squared:  0.00211,    Adjusted R-squared:  0.00161 
#> F-statistic: 4.22 on 1 and 1997 DF,  p-value: 0.04
#> 
#> Variable added: Sex  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -762   -682    242    261    291 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  4357.61    1520.77    2.87   0.0042 **
#> Year           -1.56       0.76   -2.06   0.0400 * 
#> Sex           -10.32      17.73   -0.58   0.5608   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 421 on 1996 degrees of freedom
#> Multiple R-squared:  0.00228,    Adjusted R-squared:  0.00128 
#> F-statistic: 2.28 on 2 and 1996 DF,  p-value: 0.103
#> 
#> Variable added: Age  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -838   -509    187    242    614 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) 2938.621   1498.992    1.96                0.05 .  
#> Year          -0.771      0.750   -1.03                0.30    
#> Sex          -20.014     17.418   -1.15                0.25    
#> Age           -6.093      0.672   -9.06 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 412 on 1995 degrees of freedom
#> Multiple R-squared:  0.0417, Adjusted R-squared:  0.0403 
#> F-statistic: 28.9 on 3 and 1995 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -910   -501    194    245    623 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3444.180   1501.479    2.29              0.02190 *  
#> Year          -0.959      0.750   -1.28              0.20072    
#> Sex          -19.307     17.369   -1.11              0.26646    
#> Age           -5.997      0.671   -8.94 < 0.0000000000000002 ***
#> Species       -3.877      1.095   -3.54              0.00041 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1994 degrees of freedom
#> Multiple R-squared:  0.0477, Adjusted R-squared:  0.0458 
#> F-statistic:   25 on 4 and 1994 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -921   -505    192    246    614 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3424.505   1501.471    2.28              0.02267 *  
#> Year          -0.965      0.750   -1.29              0.19798    
#> Sex          -19.096     17.369   -1.10              0.27171    
#> Age           -5.902      0.676   -8.73 < 0.0000000000000002 ***
#> Species       -3.853      1.095   -3.52              0.00044 ***
#> Time          15.613     13.759    1.13              0.25664    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1993 degrees of freedom
#> Multiple R-squared:  0.0483, Adjusted R-squared:  0.0459 
#> F-statistic: 20.2 on 5 and 1993 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -934   -503    192    249    605 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3829.068   1513.279    2.53              0.01147 *  
#> Year          -1.135      0.754   -1.51              0.13221    
#> Sex          -23.663     17.498   -1.35              0.17644    
#> Age           -5.800      0.678   -8.56 < 0.0000000000000002 ***
#> Species       -4.155      1.104   -3.76              0.00017 ***
#> Time          17.633     13.784    1.28              0.20095    
#> Type         -28.026     13.714   -2.04              0.04111 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1992 degrees of freedom
#> Multiple R-squared:  0.0503, Adjusted R-squared:  0.0474 
#> F-statistic: 17.6 on 6 and 1992 DF,  p-value: <0.0000000000000002
  • Here we have the same AIC value for each model: -66818.77. As said previously, we decide to use the model containing the most variables.

We have done the following type of models of this first regression that will help us to answer this first research question: - Overall: all countries that we have on our merged dataset. - Focus on USA. - Focus on Australia. - Focus on the two countries above.

Each type of model according to their focus has its own primary factors influencing the occurrence of shark attacks. For example, in the first one (Overall), Age, Year, Type and Time are the factors that influence the attacks. But the two that are highly significant are Age and Year. In the 4th one, the one that is focused on both countries, the main factors are Age, Species, Type and Time. In conclusion, Age, Type and Time are highly significant predictors, thus they strongly influence the occurrence of shark attacks from a statistical point of view. We did not consider models focus on one country because they are not that relevant. We choose to analyze more than 1 country because it makes more sense. We analyzed the two countries with the most shark attacks because we had the most data about these countries. It gives us more details and the significative factors that influence the shark attacks. We reduce the complexity of our model because of the focus on the regions with the most attacks, so it makes our results more relevant.

5.2 RQ2: Which are the factors leading to the fatality of a shark attack?

In this paragraph, we aim to address the research question of what factors contribute to the fatality of shark attacks. The variable we will try to study this time is “Fatality”, which appears to be a binomial variable, where 0 expresses no fatality, 1 expresses fatality. In the EDA we were already able to see that fortunately, most of shark attacks are not fatal, but still, there are some cases where these accidents take off some lives. This investigation is compelling because the identification of these factors can help guide targeted safety protocols and public education initiatives.

The model we will use to answer our question is that of a logistic regression, which utilizes ‘Date,’ ‘Year,’ ‘Age,’ and ‘Time’ as predictors.

\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \epsilon \]

where:

  1. \(\hat{Y}\) = Frequency of Shark Attacks
  2. \(X_{\text{1}}\) = Date (month)
  3. \(X_{\text{2}}\) = Year
  4. \(X_{\text{3}}\) = Age
  5. \(X_{\text{4}}\) = Time
  6. \(\epsilon\) = Residual Error

filtered_data4 <- merged_data3 %>%
  filter(ISO_Code %in% c("ZAF", "USA", "AUS"))

filtered_data4 <- merged_data3 %>%
  filter(ISO_Code %in% c("ZAF", "USA", "AUS")
         )

filtered_data4$Sex <- factor(filtered_data4$Sex)

model12 <- glm(Fatality ~ Date + Year + Age + Time, 
               data = filtered_data4, 
               family = "binomial")
Regression on the fatality of shark attacks
Dependent variable:
Fatality
Date -0.006
(0.026)
Year -0.030***
(0.006)
Age 0.027***
(0.006)
Time -0.029
(0.127)
Constant 55.800***
(12.300)
Observations 2,319
Log Likelihood -553.000
Akaike Inf. Crit. 1,115.000
Note: p<0.1; p<0.05; p<0.01

We will start by analiyzing the significant coefficient.

The intercept of 55.643 is the estimated log-odds of fatality when all predictors are zero. Since, our variable of interest is a binomal one, it can be difficult to interpret this log-odds directly, though, and it might not be clear how it affects the likelihood of death. This value needs a deepen econometric study in order to be comprehended.

The coefficient for ‘Year’ is -0.030, suggesting that the likelihood of a fatal outcome in shark attacks decreases by around 0.0324.as the year increases. This trend may be caused by developments in emergency medical response and care, public education and awareness campaigns about shark safety, enhanced beach surveillance and warning systems, or adjustments in recreational activities and behavior near coastal areas.

The ‘Age’ coefficient is 0.027, indicating that for each unit increase in age, the log-odds of fatality increase by 0.0269. One explanation for this correlation may be that older people have less physical strength and agility, which makes it harder for them to flee from or defend against a shark attack. Additionally, older people may be more vulnerable to severe injuries from shark bites. Furthermore, the observed higher risk of death in older age groups may be explained by behavioral factors such as an increased propensity to participate in riskier water activities or an increased amount of time spent in the water.

The last two coefficients, that of Date and Time are not significant, but let is dive into them!

The ‘Date’ coefficient is -0.006, but with a p-value of 0.85, it is not statistically significant, implying that the month of the shark attack may not be a strong predictor of fatality. But this was not a surprise. Indeed, the three countries we took into consideration have seasons that vary throughout the year, based on their position on the globe. For example, Summer in USA goes from June to September, but in Australia and South Africa it goes from December to February/March. This might create a disequilibrium when R reads them.

Remarkably, the ‘Time’ coefficient is 0.017, indicating that the time of day may not be a predictive factor for shark attack fatalities. The lack of significance of ‘Time’ may suggest that the time of an attack by a shark has little bearing on whether it ends fatally. Intuitively, we thought that night attacks would be more fatal, given the difficulty in asking or receiving help, but apparently this is not the case.

The model’s goodness of fit is assessed through the null and residual deviances, with a lower residual deviance of 1079 on 2190 degrees of freedom compared to the null deviance of 1120.4 on 2194 degrees of freedom. This reduction in deviance indicates that the model explains some of the variability in the data. The Akaike Information Criterion (AIC) of 1089 is a measure of the model’s relative quality, considering both fit and complexity.

5.3 RQ2: Is it true that shark attacks have been increasing with climate change?

The graphs in the section on Exploratory Data Analysis (EDA) show us that the elements impacting climate change are trending upward. Interestingly, there is a comparable rising trend in the incidence of shark attacks. It presents a valid question regarding whether these two developments are correlated and how strong that association is. The problem of climate change is intricate and multidimensional. Three primary variables will be the focus of our investigation: temperature change, sea level rise, and CO2 emissions. Although we recognize that many other factors also contribute to this environmental phenomenon, we will focus on these specific variables and simplify the analysis in this way.

We must include a new column in our dataset that indicates the number of shark attacks that occur annually in each nation in order to answer this issue. We shall be able to investigate their frequency in this way.

\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon \]

where:

  1. \(\hat{Y}\) = Frequency of Shark Attacks
  2. \(X_{\text{1}}\) = CO2 Emissions
  3. \(X_{\text{2}}\) = Temperatures Change
  4. \(X_{\text{3}}\) = Sea Level Change
  5. \(\epsilon\) = Residual Error

#The first thing we do is creating a copy of our dataset, so that we can take away some information that are not needed here. Indeed, with the na.omit function, we delete all the rows of the years
#after 1992. The reason for this is that we were not able to find a dataset on sea level information that contained information starting from 1970.

data_RQ3 <- merged_data3
data_RQ3 <- na.omit(data_RQ3) 

str(data_RQ3) #Firstly, we run the str function to check if all the variables of interest are num or int, and not chr

#Run correlation matrix to be sure that there is no multicollinearity. When we run it, we see that
#all vorrelations are far from being equal to 1 or -1, which is a positive sign.

Before running our regression, it is important to track if there is any sign of multicollinearity in our data. For this reason, we run two correlation plots. The first plot helps us visualize pretty quickly what is the correlation magnitude of the variables analized. The fact that we do not see any dark blue or dark red circle between two different variables, tells us that the correlation is far from being equal to the extremes 1 and -1.

A clearer plot is the second one, that provides the reader with the exact correlation coefficients. The takeaway is that there is no linear dependence among the predictors we have chosen: this observation gives us confidence in the validity of our regression analysis, and opens us the doors to run our regression!


subset_data <- data_RQ3[, c("Temperature", "Annual CO2 Emissions", "GMSL_GIA")]
correlation_matrix <- cor(subset_data, use = "complete.obs")
corrplot(correlation_matrix, method = "circle")

# Customized corrplot for the subset
corrplot(
  correlation_matrix,
  method = "ellipse",
  type = "upper",
  tl.col = "black",
  tl.srt = 45,
  addCoef.col = "gray"
)


# We create a new variable 'SharkAttacksCount' representing the frequency of shark attacks per year
count_shark_attacks <- data_RQ3 %>%
  group_by(Year, ISO_Code) %>%
  summarize(SharkAttacksCount = n())


# Then, we merge the aggregated shark attacks data back to your original dataset (merged_data3) based on the 'year' and 'ISO_Code' variables. We do this, so as to be able to run a the regression
data_RQ3 <- merge(data_RQ3, count_shark_attacks, by = c("Year", "ISO_Code"), all.x = TRUE)

# FIRST MODEL
filtered_data <- data_RQ3 %>%
  filter(ISO_Code %in% c("USA", "ZAF", "AUS"))

model <- lm(SharkAttacksCount ~ Temperature + GMSL_GIA + `Annual CO2 Emissions`, data = filtered_data)
Regression for shark attacks against climate change factors
Dependent variable:
SharkAttacksCount
Temperature 4.34000***
(0.55100)
GMSL_GIA 0.28600***
(0.01130)
Annual CO2 Emissions 0.00000***
(0.00000)
Constant 8.00000***
(0.56200)
Observations 1,717
R2 0.79200
Adjusted R2 0.79100
Residual Std. Error 8.83000 (df = 1713)
F Statistic 2,171.00000*** (df = 3; 1713)
Note: p<0.1; p<0.05; p<0.01

The results of our regression are really interesting to analyse! The statistical significance of the coefficients related to temperature change, sea level change, and CO2 emissions indicates their important contribution to our predictive model. The complex interaction between environmental dynamics and shark activity is highlighted by the positive correlation found between the annual frequency of shark attacks and these three climate change parameters.

The constant term denotes the baseline frequency of shark attacks when all environmental factors are held constant, and it is equal to 8.00035. This means that, when we hold all the parameters to 0 (more precisely, if there is no detection of climate change), we would still have a positive amount of shark attacks.The interpretation is fairly clear, indicating that shark attacks are obviously not caused by climate change per se.

Our second finding is that the temperature change beta coefficient is 4.34176, and the significance level is indicated by three asterisks (***). This highlights a strong and favorable correlation between temperature changes and the number of shark attacks. Similarly, the sea level change beta coefficient, which is marked with three asterisks and registers at 0.28554, indicates a significant positive correlation between rising sea levels and an increase in shark attacks. Finally, the beta coefficient for CO2 emissions is very close to zero, still remaining positive. This notifies us that an increase of CO2 emissions leads to a very small increase of shark attacks.

Finally, we also notice that \(R^2\) is equal to 0,7917, meaning that, in this context, our variables explain 79,17% of the variation of our variables of interest.

6. Conclusion

7. References

CBS News. (2010, October 24). Five Most Dangerous Sharks to Humans. https://www.cbsnews.com/pictures/five-most-dangerous-sharks-to-humans/

Midway, S. R., Wagner, T., & Burgess, G. H. (2019). Trends in global shark attacks. PLOS ONE, 14(2), e0211049. https://doi.org/10.1371/journal.pone.0211049

West, J. G. (2011). Changing patterns of shark attacks in Australian waters. Marine and Freshwater Research, 62(6), 744. https://doi.org/10.1071/mf10181

World’s biggest CO₂ emitters 2021 | Statista. (2023, September 12). Statista. https://www.statista.com/statistics/271748/the-largest-emitters-of-co2-in-the-world/